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Abstract 


This paper contributes a novel embed¬ 
ding model which measures the proba¬ 
bility of each candidate belief (h, r, t. m) 
in a large-scale knowledge repository via 
simultaneously learning distributed repre¬ 
sentations for entities (h and t), relations 
(r), and even the words in relation men¬ 
tions (m). It facilitates knowledge com¬ 
pletion by means of simple vector oper¬ 
ations to discover new beliefs. Given an 
imperfect belief, we can not only infer the 
missing entities, predict the unknown rela¬ 
tions, but also tell the plausibility of that 
belief, just by exploiting the learnt embed¬ 
dings of available evidence. To demon¬ 
strate the scalability and the effectiveness 
of our model, we conduct experiments 
on several large-scale repositories which 
contain hundreds of thousands of beliefs 
from WordNet, Freebase and NELL, and 
compare the results of a number of tasks, 
entity inference, relation prediction and 
triplet classification, with cutting-edge ap¬ 
proaches. Extensive experimental results 
show that the proposed model outperforms 
other state-of-the-art methods, with signif¬ 
icant improvements identified. 


1 Introduction 


Information extraction (Sarawagi, 2008, Grish 


man, 1997), the study of extracting structured 
beliefs from unstructured online texts to popu¬ 
late knowledge bases, has drawn much atten¬ 
tion in recent years because of the explosive 
growth in the number of web pages. Thanks 
to long-term efforts of experts, crowd sourc¬ 
ing, and even machine learning techniques, sev¬ 
eral web-scale knowledge repositories, such as 


WordneQ Freebase^ and NELL0 have been con¬ 
structed. WordNet ( jMiller, 1995| ) and Freebase 


(Bollacker et al., 2007 Bollacker et al., 20081 fol¬ 


low the RDF format that represents each belief as 
a triplet, i.e. ( head entity, relation, tail entity). 
NELL ( [Carlson et al., 2010a[ ) extends each 
triplet with a relation mention which is a 
snatch of extracted free text to indicate the cor¬ 
responding relation. Here we take a belief 
recorded in NELL as an example: ( city : 
Caroline, citylocatedinstate, stateorprovince : 
maryland, county and state of) , in which 
county and state of is the mention between the 
head entity city : Caroline and the tail entity 
stateor province : maryland, to indicate the re¬ 
lation citylocatedinstate. In some cases, NELL 
also provides the confidence of each belief auto¬ 
matically learnt by machines. 

Although colossal quantities of beliefs have 


been gathered, state-of-the-art work (West et al., 


2014) reports that our knowledge bases are far 
from complete. For instance, nearly 97% of peo¬ 
ple in the Freebase database have no records about 
their parents, whereas we can still find clues as to 
their immediate family in many cases by searching 
on the web and looking up their Wiki. 

To populate the incomplete knowledge reposi¬ 
tories, scientists either compare the performance 
of relation extraction between two named entities 
on manually annotated text datasets, such as ACeQ 
and MUCj^J or look for effective approaches for 
improving the accuracy of link prediction within 
the knowledge graphs constructed by the reposito¬ 
ries without using extra free texts. 

Recently, studies on text-based knowledge 


http://wordnet.princeton.edu/ 
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4 http://www.itl.nist.gov/iad/mig/ 
tests/ace/ 

'http://www.itl.nist.gov/iaui/8 94.02/ 
relatedprojects/muc/ 

















completion have benefited significantly from a 
paradigm known as Distantly Supervised Relation 
Extraction ( [Mintz et ah, 2009] ) (DSRE), which 
bridges the gap between structured knowledge 
bases and unstructured free texts. It alleviates the 
labor of manual annotation by means of automati¬ 
cally aligning each triplet (h, r, t) from knowledge 
bases to the corresponding relation mention m in 


free texts. However state-of-the-art research (Fan 


et al., 2014a| points out that DSRE still suffers 
from the problem of sparse and noisy features. Al¬ 


though Fan et al. (20l4aj) fix the issue to some 
extent by making use of low-dimensional matrix 
factorization, their approach was identified as un¬ 
able to handle large-scale datasets. 

Fortunately, knowledge embedding techniques 
(Bordes et al., 201 1| [Bordes et al., 20l4j ) enable 
us to encode the high-dimensional sparse features 
into low-dimensional distributed representations. 


A simple but effective model is TransE (Bordes 
et al., 20l3] ), which trains a vector representation 
for each entity and relation in large-scale knowl¬ 
edge bases without considering any text informa¬ 


tion. Even though Weston et al. (Weston et al., 
2013| ), Wang et al. (Wang et al., 2014a I and Fan et 
al. (Fan et al., 2015a) broaden this field by adding 
word embeddings, there is still no comprehensive 
and elegant model that can integrate such large- 
scale heterogeneous resources to satisfy multiple 
subtasks of knowledge completion including en¬ 
tity inference, relation prediction and triplet clas¬ 
sification. 


Therefore, the contribution of this paper is a 
proposed novel embedding model which measures 
the probability of each belief (h, r , t, m) in large- 
scale repositories. It breaks through the limitation 
of heterogeneous data, and establishes the connec¬ 
tion between structured knowledge graphs and un¬ 
structured free texts. The distributed representa¬ 
tions for entities (h and t ), relations (r), as well as 
the words in relation mentions (m) are simultane¬ 
ously learnt within the uniform framework of the 
probabilistic belief embedding (PBE) we propose. 
Knowledge completion is facilitated by means of 
simple vector operations to discover new beliefs. 
Given an imperfect belief, we can not only in¬ 
fer the missing entities, predict the unknown re¬ 
lations, but tell the plausibility of the belief as 
well, just by means of the learnt vector representa¬ 
tions of available data. To prove the effectiveness 
and the scalability of PBE, we perform extensive 


experiments with multiple tasks, including entity 
inference, relation prediction and triplet classifi¬ 
cation, for knowledge completion, and evaluate 
both our model and other cutting-edge approaches 
with appropriate metrics on several large-scale 
datasets which contain hundreds of thousands of 
beliefs from WordNet, Freebase and NEFF. De¬ 
tailed comparison results demonstrate that the pro¬ 
posed model outperforms other state-of-the-art ap¬ 
proaches with significant improvements identified. 


2 Related Work 


Embedding-based inference models usually de¬ 
sign various scoring functions f r (h, t ) to measure 
the plausibility of a triplet (h, r, t). The lower the 
dissimilarity of the scoring function f r (h, t ) is, the 
higher the compatibility of the triplet will be. 

Unstructured ( [Bordes et al., 2013] ) is a naive 
model which exploits the occurrence information 
of the head and the tail entities without consider¬ 
ing the relation between them. It defines a scoring 
function 11 h — 111, and this model obviously can 
not discriminate between a pair of entities involv¬ 
ing different relations. Therefore, Unstructured is 
commonly regarded as the baseline approach. 

Distance Model (SE) ( [Bordes et al., 20lT) uses 
a pair of matrices (W r hi W r t), to characterize a re¬ 
lation r. The dissimilarity of a triplet is calculated 
by ||W r /ih—W r tt ||i. As identified by Socher et al. 
( [Socher et al., 2013] ), the separating matrices W r h 
and W r t weaken the capability of capturing cor¬ 
relations between entities and corresponding rela¬ 
tions, despite the model taking the relations into 
consideration. 

Single Layer Model, proposed by Socher et al. 
( [Socher et al., 20T3| ) thus aims to alleviate the 
shortcomings of the Distance Model by means of 
the non-linearity of a single layer neural network 
g(W r hh + W r t t + b r ), in which g = tank. The 
linear output layer then gives the scoring function: 
uJg{W rh h + W rt t + b r ). 


Bilinear Model (Sutskever et al., 2009 Jenat- 


ton et al., 2012) is another model that tries to fix 


the issue of weak interaction between the head 
and tail entities caused by the Distance Model 
with a relation-specific bilinear form: f r (h,t ) = 
h T W r t. 


Neural Tensor Network (NTN) (Socher et 


|al., 2013] ) designs a general scoring function: 
fr(h,t ) = uj g(h T W r t + Wrh h + W rt t + b r ), 
which combines the Single Layer and Bilinear 






































Models. This model is more expressive as the 
second-order correlations are also considered in 
the non-linear transformation function, but the 
computational complexity is rather high. 

TrcinsE (Bordes et al., 2 013) is a canonical 
model different from all the other prior arts, which 
embeds relations into the same vector space as en¬ 
tities by regarding the relation r as a translation 
from h to t, i.e. h + r = t. It works well on beliefs 
with the ONE-TO-ONE mapping property but per¬ 
forms badly with multi-mapping beliefs. Given a 
series of facts associated with a ONE-TO-MANY 


relation r, e.g. (h, r, ti), (h, r, t 2 ),(h, r, t m ), 

TransE tends to represent the embeddings of enti¬ 
ties on the MANY-side as extremely close to each 
other with very little discrimination. 

TransM (Fan et al., 2014bl exploits the struc¬ 
ture of the whole knowledge graph, and adjusts 
the learning rate, which is specific to each relation 
based on the multiple mapping property of the re¬ 
lation. 


TransH ( jWang et al., 2014b) is, to the best 
knowledge of the authors, the state of the art ap¬ 
proach. It improves TransE by modeling a rela¬ 
tion as a hyperplane, which makes it more flex¬ 
ible with regard to modeling beliefs with multi¬ 
mapping properties. 

Due to the diverse feature spaces between un¬ 
structured texts and structured beliefs, one key 
challenge of connecting natural language and 
knowledge is to project the features into the same 
space and to merge them together for knowledge 
completion. Fan et al. ( (Fan et al., 2015a l have re¬ 
cently proposed JRME to jointly learn the embed¬ 
ding representations for both relations and men¬ 
tions in order to predict unknown relations be¬ 
tween entities in NELL. However, the function¬ 
ality of their latest method is limited to the rela¬ 
tion prediction task (see section 5.31, as the corre¬ 
lations between entities and relations are ignored. 
Therefore, we desire a comprehensive model that 
can simultaneously consider entities, relations and 
even the relation mentions, and can integrate the 
heterogeneous resources to support multiple sub¬ 
tasks of knowledge completion, such as entity in¬ 
ference, relation prediction and triplet classifica¬ 
tion. 


3 Model 

The intuition of the subsequent theory is 
that: Not each belief we have learnt, i.e. 


{head entity , relation , tail entity , mention) 
abbreviated as ( h,r,t,m ), is perfect and com¬ 
plete enough (Fan et al., 2015b). We thus 
explore modeling the probability of each be¬ 
lief, i.e. Pr(h,r,t,m). It is assumed that 
Pr(h,r,t,m ) is collaboratively influenced by 
Pr(h\r,t), Pr(t\h,r ) and Pr(r\h, t, m), where 
Pr(li\r,t) stands for the conditional probability 
of inferring the head entity h given the relation 
r and the tail entity t, Pr(t\h,r ) represents the 
conditional probability of inferring the tail entity 
t given the head entity h and the relation r, and 
Pr(r\h,t,m) denotes the conditional probability 
of predicting the relation r between the head en¬ 
tity h and the tail entity t with the relation mention 
m extracted from free texts. Therefore, we define 
that the probability of a belief equals to the geo¬ 
metric mean of Pr(h\r, t)Pr(r\h, t, m)Pr(t\h , r ) 
as shown in the subsequent equation, 

Pr(h,r,t,m ) = y/Pr(h\r, t)Pr(r\h, t, m)Pr(t\h, r). 

( 1 ) 

We assume that we have a certain repository A, 
such as WordNet, which contains thousands of be¬ 
liefs validated by experts. The learning object is 
intuitively set to maximize E max , where 

£"max — n Pr(h,r,t,m). (2) 

( h,r,t,m)£A 

In most cases, we can automatically build much 
larger but imperfect knowledge bases as well 
via crowdsouring (Freebase) and machine learn¬ 
ing techniques (NELL). However, each belief of 
NELL has a confidence-weighted score c to indi¬ 
cate its plausibility to some extent. Therefore, we 
propose an alternative goal which aims to mini¬ 
mize £ m in, in which 

Emin = n ^ [Pr{h, r, t, m) — c] 2 . (3) 
(h,r,t,m,c)£ A 

To facilitate the optimization progress, we pre¬ 
fer using the log likelihood of £ max and E m i n , and 
the learning targets can be further processed as fol¬ 
lows, 

arg max log E max 

h,r,t,m 

= arg max log Pr(h, r, t, rn) 

h ^’ m {h,r,t,m)e A (4) 

= arg max - [log Pr(h\r, t) 

’ 5 ’ (/i,r,£,ra)EA 

+ logPr(r\h,t,m) + log Pr(t\h, r)]; 











Fragment of Knowledge Graph 


Belief Embedding Space 


Snatch of Text Mention 


Toronto 



The Toronto Maple Leafs (officially the 
Toronto Maple Leaf Hockey Club) is a 
professional ice hockey team based in 
Toronto. Ontario, Canada. They are 
members of the Atlantic Division in the 
Eastern Conference of the National 
Hockey League ( NHL) . The Maple Leafs 
have won thirteen Stanley Cup 
championships, second only to the 
twenty-four championships of their 
primary rival, the Montreal Canadiens. 
They won their last championship in 1967. 
Their 47-year drought between 
championships is currently the longest in 
the NHL. 


(C) 


Figure 1: The whole framework of belief embedding, (a) shows a fragment of knowledge graph; 
(c) is a snatch of Wiki which describes the knowledge graph of (a); (b) illustrates how the belief 
(Maple Leafs, home town,Toronto, team based in) is projected into the same embedding space. 


arg min log C min 

h,r,t,m 

= arg min - [log Pr(h, r, t, m) — log c] 

h ' r ^ m (h,r,t,m,c)GA 2 

= arg min ^ ]-{]- [log Pr(h\r, t) 

h ^ m <fc,r,f,m,c>eA 6 

+ log Pr(r\h,t,m) + log Pr(t\h, r)] — logc} 2 . 

(5) 

The advantage of the conversions above is that we 
can separate the factors out, compared with Equa¬ 
tion (1). Therefore, the remaining challenge is to 
identify the approaches to use to model Pr(h\r, t ), 
Pr(r\h, t, m) and Pr(t\h, r ). 

Pr(r\h, t, m) leverages the evidences from two 
different resources to predict the relation. If the 
concurrence of the two entities (h and t) in knowl¬ 
edge bases is independent of the appearance of the 
relation mention m from free texts, we can factor¬ 
ize Pr(r\h, t, m) as shown by Equation (6): 


approaches ( [Bordes et al., 20lT ). However, once 
each of the elements, including entities and rela¬ 
tions in the knowledge repository, are projected 
9 into the same embedding space, we can use: 


2>(h, r,t) = — ||h + r — t|| + a, 


(7) 


which is a simple vector operation to measure the 
distance between h + r and t, where h, r and t 
are encoded in d dimensional vectors, and a is the 
bias parameter. To estimate the conditional proba¬ 
bility of appearing t given h and r, i.e. Pr(t\h, r), 
however, we need to adopt the softmax function as 
follows, 


Pr(t\h, r ) 


exp 


3>(h,r,t) 


Et'GEt eX P 




( 8 ) 


where E t is the set of tail entities which contains 
all possible entities t' appearing in the tail position. 
Similarly, we can regard Pr(h\r, t ) and Pr(r\h, t ) 
as 


Pr(h\r, t ) 


exp 




Eh'£E h ex P 




(9) 


Pr(r\h, t, m) = Pr(r\h, t)Pr(r\m). (6) and 


We then need to consider formulating 

Pr(h\r, t), Pr(r\h,t), Pr(t\h,r) and Pr(r\m), 

respectively. 

Figure 1(a) illustrates the traditional way of 
recording knowledge as triplets. The triplets 
(h, r, t) can construct a knowledge graph in which 
entities (h and t) are nodes and the relation (r) be¬ 
tween them is a directed edge from the head entity 
(h) to the tail entity (t). This kind of symbolic 
representation, whilst being very efficient for stor¬ 
ing, is not flexible enough for statistical learning 


Pr(r\h, t ) 


exp 




Er'efl ex P 


3>(h,r’ ,t) ’ 


( 10 ) 


in which E^ is the set of head entities which con¬ 
tains all possible entities h’ appearing in the head 
position, and R is the set of all candidate relations 
r'. 

In addition to this. Figure 1(c) shows that free 
texts can provide fruitful contexts between two 
recognized entities, but the one-hoj^feature space 

f http://en.wikipedia.org/wiki/One-hot 


















is rather high and sparse. Therefore, we can also 
project each words in relation ’mentions’ into the 
same embedding space of entities and relations. 
To measure the similarity between the mention m 
and the corresponding relation r, we adopt the 
inner product of their embeddings as shown by 
Equation (11), 

&(r, m) = W T (f)(m)r + /3, (11) 

where W is the matrix of containing n v vo¬ 

cabularies with d dimensional embeddings, (pirn) 
is the sparse one-hot representation of the men¬ 
tion indicating the absence or presence of words, 
r e is the embedding of relation r, and /3 is 
the bias parameter. Similar to Equations (8), (9) 
and (10), the conditional probability of predicting 
relation r given mention m, i.e. Pr(r\m) can be 
defined as, 


Pr(r\m) = 


exp 


i J^’(r,m) 


V PYT)^( f, ) m ) * 

Z^r'eR ex P 


( 12 ) 


Overall, this section has shown that we can 
model the probability of a belief via jointly em¬ 
bedding the entities, relations and even the words 
in mentions, as demonstrated by Figure 1(b). 


4 Algorithm 

To search for the optimal solutions of Equation 
(4) and (5), we can use Stochastic Gradient De¬ 
scent (SGD) to update the embeddings of enti¬ 
ties, relations and words of mentions in iterative 
fashion. However, it is computationally intensive 
to calculate the normalization terms in Pr(h\r, t ), 
Pr(r\h,t), Pr(t\h,r ) and Pr(r\m) according to 
the definitions made by Equation (8), (9), (10) and 
(12) respectively. For instance, if we directly cal¬ 
culate the value of Pr(h\r,t) for just one belief, 
tens of thousands exp^( h ' ,r,t ^ need to be re-valued, 
as there are tens of thousands of candidate entities 
h! in E}, . Inspired by the work of Mikolov et al. 
( Mikolov et al., 20T3) , we have developed an effi¬ 
cient approach that adopts the negative sampling 
technique to approximate the conditional proba¬ 
bility functions, i.e. Equations (8), (9), (10) and 
(12), by being transformed to binary classification 
problems as shown respectively by the subsequent 
equations, 

logPr(h\r,t) « logPr(l\h,r,t) 

k (13) 

+ Y E h[Pr(h'£E h ) log Pr(0\h[,r, t ), 

2=1 


log Pr(t\h,r) « log Pr(l\h,r,t) 
k 

+ Y E t' Pr(feE t ) log Pr(0\h, r , t ■), 
2=1 

log Pr(r\h,t) « logPr(l\h,r, t) 

k 

+ Y E r'Pr(r'eR) log Pr{0\h, r', t), 
i =1 

logPr(rjm) ~ log Pr(l\r, m) 
k 

+ Y E <Pr(r'£R ) logPr(0|rJ, 77l), 
2=1 


( 14 ) 


(15) 


(16) 


where we sample k negative beliefs and discrimi¬ 
nate them from the positive case. For the simple 
binary classification problems mentioned above, 
we choose the logistic function with the offset e 
shown in Equation (17) to estimate the probability 
that the given triplet (h, r, t) is correct: 


Pr(l\h, r, t ) 


1 

1 + exp - ®( ft ’ r ’ t ) ’ 


(17) 


and with the offset r; shown in Equation (18) to tell 
the probability of the occurrence of r and m: 


Pr(l|r, m) 


1 

1 + exp-^O’™) + V 


(18) 


5 Experiment 

Besides its access to the efficient SGD algorithm, 
the learnt embeddings by PBE can contribute more 
effectiveness on multiple subtasks of knowledge 
completion, such as entity inference, relation pre¬ 
diction, and triplet classification. 


5.1 Dataset 


To demonstrate the wide adaptability and signif¬ 
icant effectiveness of our approach, we prepare 
three datasets as shown by Table 1, i.e. NELL- 
50K, WN-100K, FB-500K from the repositories 


of NELL (Carlson et al., 2010b 

), WordNet (Miller, 

1995 

) and Freebase (Bollacker et al., 2007 

Bol- 

lacker et al., 2008) respecti\ 

/ely. The NELL 
ed and maintained 

(Mitchell et al., 2015) design 


by Carnegie Mellon University is an outstand¬ 
ing system which runs 24 hours/day and never 
stops learning the beliefs on the Web. We use 
a relatively small dataset NELL-50K which con¬ 
tains about 50 thousand confidence-weighted be¬ 
liefs from NELL. Each belief of NELL-50K has a 
relation mention m in addition to a triplet (h, r, t). 
WN-100K is made by experts, and owns only 11 



























DATASET 

NELL-50K 

WN-100K 

FB-500K 

#(ENTITY) 

29,904 

38,696 

14,951 

#(RELATION) 

233 

11 

1,345 

#(VOCABULARY) 

8,948 

- 

- 

#(TRAINING EX.) 

57,356 

112,581 

483,142 

#(VALIDATING EX.) 

10,710 

5,218 

50,000 

#(TESTING EX.) 

10,711 

21,088 

59,071 

#(TC VALIDATING EX.) 

21,420 

10,436 

100,000 

#(TC TESTING EX.) 

21,412 

42,176 

118,142 


Table 1: Statistics of the datasets used for the subtasks, i.e. entity inference, relation prediction and 
triplet classification of knowledge completion. 


kinds of relations but much more entities. There¬ 
fore, it is a sparse repository in which fewer en¬ 
tities have connections. The third dataset (FB- 
500 we adopt is released by Bordes et al. (Bor- 


|des et al., 2013 1. It is a large crowdsourcing 
dataset extracted from Freebase, in which each be¬ 
lief is a triplet without a confidence score. 

As the words in relation mentions will be fur¬ 
ther concerned in the relation prediction subtask, 
we also show the vocabulary size of each dataset. 
However, WN-100K and FB-500K only contain 
triplets as beliefs, so their vocabulary sizes are 
null. 

For the triplet classification subtask, the head or 
the tail entity can be randomly replaced with an¬ 
other one to produce a negative training example. 
But in order to build much tough validation and 
testing datasets, we constrain that the picked en¬ 
tity should once appear at the same position. For 
example, (Pablo Picaso, nationality, U.S.) is a po¬ 
tential negative example rather than the obvious 
nonsense (Pablo Picaso, nationality, Van Gogh), 
given a positive triplet (Pablo Picaso, nationality, 
Spain). 


best tail entity given the head entity h and the re¬ 
lation r. 

5.2.1 Metric 

For each testing belief, all the other entities that 
appear in the training set take turns to replace the 
head entity. Then we get a bunch of candidate 
triplets. The plausibility of each candidate triplet 
is firstly computed by various scoring functions, 
such as Pr(h\r, t ) in PBE, and then sorted in as¬ 
cending order. Finally, we locate the ground-truth 
triplet and record its rank. This whole procedure 
runs in the same way when replacing the tail en¬ 
tity, so that we can gain the mean results. We use 
two metrics, i.e. Mean Rank and Mean Plit@10 
(the proportion of ground truth triplets that rank 
in Top 10), to measure the performance. However, 
the results measured by those metrics are relatively 
raw, as the procedure above tends to generate false 
negative triplets. In other words, some of the can¬ 
didate triplets rank rather higher than the ground 
truth triplet just because they also appear in the 
training set. We thus filter out those triplets to re¬ 
port more reasonable results. 


5.2 Entity inference 

One of the benefits of knowledge embedding is 
that simple vector operations can apply to entity 
inference which contributes to knowledge graph 
completion. Given a wrecked triplet, like (h, r, ?} 
or (?,r, f), the subtask needs to compute the 
arg m&x heEh Pr(h\r, t ), with the help of the en¬ 
tity and relation embeddings. In the meanwhile, 
arg max teBt Pr(t\h, r) will help us to find the 

7 We change the original name of the dataset (FB15K), 
so as to follow the naming conventions in our paper. 
Related studies on this dataset can be looked up from 
the website https://www.hds.utc.fr/everest/ 
doku.php?id=en:transe 


5.2.2 Performance 

We compare PBE with the state-of-the-art TransH 
(Wang et al., 2014b), TransM (Fan et al., 2014b l, 
TransE (Bor des et al.^_2013) mentioned in Sec¬ 

tion 2 evaluated on NELL-50K, WN-100K and 
FB-500K datasets. We tune the parameters of 
each previous model based on the validation set, 
and select the combination of parameters which 
leads to the best performance. Table 2, 3 and 4 
demonstrate that PBE outperforms the prior arts 
on almost all the metrics. Overall, it achieves 
significant improvements (relative increment) on 
all three datasets, with NELL-50K: {Mean Rank 
Raw: 4.9% j]\ Hit@10 Raw: 4.2% j]', Mean Rank 
















DATASET 

NELL-50K 

METRIC 

MEAN RANK 

MEAN HIT@ 10 

Raw 

Filter 

Raw 

Filter 

TransE 

2,436 / 29,904 

2,426 / 29,904 

18.9% 

19.6% 

TransM 

2,296 / 29,904 

2,285 / 29,904 

20.5% 

21.3% 

TransH 

2,185/29,904 

2,072 / 29,904 

21.6% 

28.8% 

PBE 

2,078 / 29,904 

1,996 / 29,904 

22.5% 

26.4% 


Table 2: Entity inference results on the NELL-50K dataset. We compared our proposed PBE with the 
state-of-the-art method TransH and other prior aits mentioned in Section 2. 


DATASET 

WN-100K 

METRIC 

MEAN RANK 

MEAN HIT@ 10 

Raw 

Filter 

Raw 

Filter 

TransE 

10,623 / 38,696 

10,575/38,696 

3.8% 

4.1% 

TransM 

14,586/38,696 

13,276/38,696 

1.8% 

2.0% 

TransH 

12,542/ 38,696 

12,463 / 38,696 

2.3% 

2.6% 

PBE 

8,462 / 38,696 

8,409 / 38,696 

9.0% 

10.1% 


Table 3: Entity inference results on the WN-100K dataset. We compared our proposed PBE with the 
state-of-the-art method TransH and other prior aits mentioned in Section 2. 


DATASET 

FB-500K 

MFTRir 

MEAN RANK 

MEAN HIT@ 10 

IVllA 1 1X1V. 

Raw 

Filter 

Raw 

Filter 

TransE 

243/ 14,951 

125/ 14,951 

34.9% 

47.1% 

TransM 

196/ 14,951 

93/ 14,951 

44.6% 

55.2% 

TransH 

211/14,951 

84/ 14,951 

42.5% 

58.5% 

PBE 

165/ 14,951 

61/14,951 

50.5% 

64.6% 


Table 4: Entity inference results on the FB-500K dataset. We compared our proposed PBE with the 
state-of-the-art method TransH and other prior aits mentioned in Section 2. 


DATASET 

NELL-50K 

WN-100K 

FB-500K 

METRIC 

AVG. R. 

H1T@10 

H1T@1 

AVG. R. 

H1T@10 

H1T@1 

AVG. R. 

H1T@10 

H1T@1 

TransE 

131.8 

16.3% 

3.0% 

3.8 

98.3% 

15.1% 

762.7 

13% 

1.9% 

TransM 

70.2 

18.9% 

4.3% 

4.6 

97.5% 

14.8% 

402.3 

13.4% 

3.2% 

TransH 

46.3 

20.0% 

5.1% 

3.4 

99.0% 

19.3% 

79.5 

39.2% 

15.6% 

JRME 

6.2 

87.8% 

60.2% 

3.9 

99.0% 

15.9% 

60.9 

27.4% 

7.2% 

PBE 

2.5 

96.6% 

78.3% 

2.0 

99.1% 

72.6% 

2.6 

97.3% 

66.7% 


Table 5: Performance of relation prediction on TransE, TransM, TransH, JRME and PBE evaluated by 
the metrics of Average Rank, Hit@10 and Hit@l with NELL-50K, WN-100K, and FB-500K dataset. 


DATASET 

NELL-50K 

WN-100K 

FB-500K 

METRIC 

ACC. 

ACC. 

ACC. 

TransE 

80.5% 

64.2% 

79.9% 

TransM 

82.0% 

57.2% 

85.8% 

TransH 

83.6% 

59.5% 

87.7% 

PBE 

90.2% 

67.8% 

92.6% 


Table 6: The accuracy of triplet classification compared with several latest approaches: TransH, TransM 
and TransE, with NELL-50K, WN-100K, and FB-500K dataset. 








Filter. 3.7% ft, Hit® 10 Filter 8.3% ft}, WN- 
100K: {Mean Rank Raw: 20.3% ft, Hit@10 Raw: 
136.8% ft, Mean Rank Filter: 20.5% ft, Hit@10 
Filter 146.3% ft} and FB-500K: {Mean Rank 
Raw: 15.8% ft, Hit® 10 Raw: 27.3% ft, Mean 
Rank Filter 13.3% ft, Hit@ 10 Filter 10.4% ft}. 

5.3 Relation prediction 

The scenario of this subtask is that: given 
a pair of entities and the text mentions in¬ 
dicating the semantic relations between them, 
i.e. (h,?,t,m), this subtask computes the 
arg max r£fl Pr(r\h, t)Pr(r\m) to predict the 
best relations. 

5.3.1 Metric 

We compare the performances between our mod¬ 
els and other state-of-the-art approaches, with the 
metrics as follows, 

Average Rank: Each candidate relation will gain 
a score calculated by Equation (7). We sort them 
in ascent order and compare with the correspond¬ 
ing ground-truth belief. For each belief in the test¬ 
ing set, we get the rank of the correct relation. The 
average rank is an aggregative indicator, to some 
extent, to judge the overall performance on rela¬ 
tion extraction of an approach. 

Hit@10: Besides the average rank, scientists 
from the industrials concern more about the accu¬ 
racy of extraction when selecting Top 10 relations. 
This metric shows the proportion of beliefs that we 
predict the correct relation ranked in Top 10. 

Hit@F. It is a more strict metric that can be re¬ 
ferred by automatic system, since it demonstrates 
the accuracy when just picking the first predicted 
relation in the sorted list. 

5.3.2 Performance 

Table 5 illustrates the results of experiments on 
relation prediction with all the three datasets, re¬ 
spectively. We find out that text mentions within 
the NELL-50K contribute a lot on predicting the 
correct relations. All of results show that PBE per¬ 
forms best compared with the latest approaches. 
The relative increments are NELL-50K: {Mean 
Rank: 59.7% ft, Hit® 10: 10.0% ft, Hit@l: 
30.0% ft}, WN-100K: { Mean Rank: 41.1% ft, 
Hit@10: 0.1% ft, Hit@l: 276.2% ft } and 
FB-500K: { Mean Rank: 95.7% ft, Hit@10: 
148.2% ft, Hit@l: 327.6% ft }. 


5.4 Triplet classification 


Triplet classification is another inference re 
lated task proposed by Socher et al. (Socher 


et al., 2013) which focuses on searching a 


relation-specific threshold oy to identify whether 
a triplet (h,r, t) is plausible. If the prob¬ 
ability of a testing triplet (h, r, t) computed 
by Pr(h\r, t)Pr(r\h, t)Pr(t\h, r) is below the 
relation-specific threshold ay, it is predicted as 
positive, otherwise negative. 


5.4.1 Metric 

We use classification accuracy to measure the per¬ 
formances among the competing methods. Specif¬ 
ically, we sum up the correctness of each triplet 
(h, r, t) via comparing the probability of the triplet 
and the relation-specific threshold oy, which can 
be searched via maximizing the classification ac¬ 
curacy on the validation triplets which belong to 
the relation r. 


5.4.2 Performance 

Compared with several of the latest approaches, 
i.e. TransH (Wang et al„ 2014b] ), TransM( Fan 
et al., 2014b] ) and TransE ( |Bordes et al., 20l3| ), 
the proposed PBE approach still outperforms 
them with the improvements that NELL-50K: 
{Accuracy: 7.9% ft}, WN-100K: {Accuracy: 
5.6% ft} and FB-500K: {Accuracy: 5.6% ft}, as 
shown in Table 6. 


6 Conclusion 

This paper proposed an elegant probabilistic 
model to tackle the problem of embedding be¬ 
liefs which contain both structured knowledge and 
unstructured free texts, by firstly measuring the 
probability of a given belief ( h,r,t,m ). To effi¬ 
ciently learn the embeddings for each entity, re¬ 
lation, and word in mentions, we also adopted 
the negative sampling technique to transform the 
original model and display the algorithm based 
on stochastic gradient descent to search for the 
optimal solution. Extensive knowledge comple¬ 
tion experiments, including entity inference, rela¬ 
tion prediction and triplet classification, showed 
that our approach achieves significant improve¬ 
ment when tested with three large-scale reposi¬ 
tories, compared with other state-of-the-art meth¬ 
ods. 
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